A formalism for studying long-range correlations in 
many-alphabets sequences. 

S. L. Narasimhan* , Joseph A. Nathan+, P. S. R. Krishna and K. P. N. Murthy** 
Solid State Physics Division, ~^ Reactor Physics Design Division 
Bhabha Atomic Research Centre, Mumbai-400085, India. 
** Materials Science Division, Indira Gandhi Centre for Atomic Research, 
Kalpakkam 603102, Tamilnadu, /ndiaQ 

Abstract 

We formulate a mean-field-like theory of long-range correlated L-alphabets sequences, which 
are actually systems with (L — 1) independent parameters. Depending on the values of these 
parameters, the variance on the average number of any given symbol in the sequence shows a 
linear or a superlinear dependence on the total length of the sequence. We present exact solution 
to the four-alphabets and three-alphabets sequences. We also demonstrate that a mapping of the 
given sequence into a smaller alphabets sequence (namely, a coarse- graining process) does not 
necessarily imply that long-range correlations found in the latter would correspond to those of the 
former. 
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I. INTRODUCTION 



A standard method for studying the stochastic behavior of complex physical, chemical or 
even biological systems consists first in dividing the set of all states available to them into a 
finite number of distinct classes, labelled by distinct symbols, and then in representing the 
dynamical evolution of these systems as sequences of these symbols [1,2,3]. Statistical corre- 
lation between the symbols of this representative sequence, if any, will then tell us about the 
way the original system evolves. In particular, the presence of long-range correlations (LRC) 
in the symbolic sequence will suggest a history-dependence of the system's evolution. 

If on the other hand these sequences represent the states available to the various parts of 
the system under study, then their statistical properties could also throw light on the way 
these subsystems are organized. For example, natural language texts are made up of words 
put together according to certain syntactic rules; so, when they are treated as sequences 
of alphabets, their syntactic structure and semantic content should manifest as correlations 
between the alphabets. Since the rules for putting words together do not extend beyond 
a sentence, we may expect the syntactic structure of the text to show up as short-range 
correlations, whereas we may expect the semantic content of the text to show up as long- 
range correlations between the alphabets. 

Conversely, it is of interest to know whether long-range correlations present in the rep- 
resentative symbolic sequence always implies a non-local or global behavior of the original 
system and if so, whether they can also provide a parametric description. Reducing the num- 
ber of alphabets or symbols, called coarse- graining of the sequence, is equivalent to reducing 
the number of parameters in the problem; such a reduction of parameters is not expected to 
affect long-range correlations that might be present in the sequence but may however lead 
to a loss of short-range correlations. Therefore, if the interest is in studying the non-local or 
global behavior of the system that shows up as long-range coorelation between symbols of 
its representative sequence, then we may as well group the symbols into two distinct classes 
and represent the system in terms of a binary sequence. 

Recently, it has been suggested [4] that non-trivial correlations in a binary sequence 
could be due to the presence of a long-range memory whose strength exceeds a certain 
critical vahic. More specifically, the number of zeros, A^^o, in given a binary sequence = 
{(Tj = 0,1 I i = 1,..., N}, is expected to be close to N/2 if the sequence Sn were long 
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and unbiassed. It is, in fact, a random variable gaussian-distributed around its expected 
mean value, N/2, with a variance denoted by cr'^{N). It has been shown by Keshet and 
Hod [4] that the existence of non-trivial correlations in the sequence can be inferred from 
the anomalous power-law dependence of o"^(A^) on iV - namely, o'^(iV) ~ N^^^ where /i, the 
strength of long-range memory, exceeds the critical value 1/2. This mean-field-like theory 
of correlated binary sequences seems to provide a paradigm for studying the correlational 
properties of generic symbolic sequences such as DNA and even natural language texts. 

The implicit assumption in this approach is that the many-alphabets sequence represent- 
ing an LRC system can always be coarse-grained into a binary sequence without losing the 
long-range correlational properties of the original sequence. There is no a priori reason why 
this assumption should be expected to hold good in general. For example, it has been argued 
[5] that a minimum of ten letters (not less than five letters, in any case) are needed for de- 
signing a foldable model of amino acid (twenty-alphabets) sequences. This example suggests 
that the minimum number of symbols required for a sequential representation of a system 
(or equivalently, the extent to which the state-space of a system can be coarse-grained) may 
depend on the specific behaviour of the system under study. 

In the next section, we present a mean-field-like theory of an L-alphabets sequence with 
long-range memory, which is a generalization of our earlier study on ternary sequences [6] . In 
the third section, we work out the exact phase diagrams for the special cases of four-symbol 
and three-symbol problems. We also show that a mapping of the ternary sequence to the 
binary sequence could lead to spurious correlations. Whether this result holds good for a 
generic symbolic sequence is a moot question because the formulation presented here deals 
with non-stationary sequences. We summarize the results in the last section. 

II. MANY-ALPHABETS SEQUENCES 

In order to study the statistical properties of such sequences, we first need to define the 
conditional probability, p{ti e a \ T^^i), that the ith symbol, ti, in the sequence will be one 
of a = {0, 1, 2, L — 1}, where Tisf^i{= ti^Nti-N+i, ■■■,'ti-i) denotes the earlier part of the 
sequence under consideration. Depending on whether T/v « denotes only a part or the whole 
of the earlier sequence, we may say that the sequence being studied has finite or infinite 
memory. We are concerned here with sequence with infinite or unbounded memory. The 
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conditional probability, p{ti e a \ Tisf^i), could in general depend not only on the individual 
symbols but also on their specific positional ordering in the sequence. Ignoring the possible 
configuration-dependence of this conditional probability leads to a solvable mean-field-like 

theory of these sequences. 

We first define the conditional probability, p{ti = | Tjv,j), of finding zero as the i*'* 
symbol in the sequence as follows: 

1 

p{ti = I TN,i) = J^Y1 ^i9o (1) 

3=0 

where n/s denote the number of j''s in the sequence such that J^^g^ rij = N, and g/s denote 
the a priori probabilities of choosing the respective symbols; only L — 1 of the symbols are 
independent because the a priori probabilities for the symbols will have to add up to unity 

The conditional probabilities, p{ti — j \ TN,i), for the rest of the symbols, j — 1, 2, L— 1, 
to be found at the i*'* position of the sequence may be defined [6] in terms of the complemen- 
tary sequences Tl,^{= tl_j^tl_^^^, where = (tj_fc+j)( mod L) for A; = 1, 2, N: 

p(U^j\TN,i) = p{U^O\Ty 
1 

= -^^ni9ij+l)i modL) (2) 

1 

= 1^ n{L-j+l){ mod L)gi (3) 

1=0 

Clearly, these definitions ensure that J^jPi'ti = j \ TN,i) = 1- We may now parametrize 
the deviations of gi from their 'unbiassed' values 1/L by writing gi — {1 + iii)/L where 
IJ'iea £ [~^,L — 1] and X^igaA*' ~ 0, and then reexpress the conditional probabilities in 
terms of the independent symbols, {0, 1, 2, L — 2}: 

p{ti = 1 T^,i) = ^ + ^ i^-^'-"') 

I f I 1 \ 

Mm = [2n^+ ^k\-N (6) 

\ fc=0{^m) / 

fli = n(^L_j^i){^ mod L) — n(^L-j-l) (7) 
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We may also denote these conditional probabilities by pj{no,ni, • • ■ ,nL-2',N) in order to 
show the independent symbols and variables explicitly. We may simplify the notations 
further by defining n = (no,ni,--- ,11^-2) and — 1 = (no,ni,--- ,nj — 1, • • • ,nL-2)- 
Accordingly, p{ti = j \ Tj^^i) could be denoted by Pj(n; N). 

Now, a sequence of N symbols is completely described by the probabihty, Qija.] N), that 
there are no zeros, ni ones and so on in the sequence. We may write the following discrete 
equation for (5(n; N): 

L-2 / L-2 \ 

g(n; TV + 1) = ^p,-(n,- - 1; iV)Q(n,. - 1; TV) + 1 - J]p,.(n; N) Q{n; N) (8) 
j=0 \ j=o J 

Since we expect the average number of any symbol in the sequence to be close to its asymp- 
totic value, N/L, we may rewrite the above equation in terms of the variables, xj = Lrij — N 
for all j = 0, 1, ■ ■ ■ , L — 2. In doing so, we make use of the correspondence, 

(n;7V + l) ^ (j^^-i^xi-l,--- ,XL-2-l;N + l) = {X-l;N + l) 
{iij - 1; iV) {xo, xi,--- ,Xj - L,--- , xl-2] N) = (Xj - L; N) 

and rewrite Eq.(8) in the form, 

L-2 / L-2 \ 

Q(X - 1; iV + 1) = Y,p,{Xj - L- N)Q{Xj - L; iV) + 1 - 5]p,(X; N) Q(X; N) (9) 
j=0 \ j=o J 

The transition probabilities, Pj(X, N) are given by 

p,(X;iV)^i(^l + l/,.(X,A)); J = 0,1, ■■•L-2 (10) 
where the 'drift' forces, /'s are defined as follows: 

L-2 

/o(X,A) = J^XiXi (11) 



1=0 

L-3 



/i(X, A) = -XoXL-2 + 5^(A;+i - Ao)a;; (12) 



1=0 

L-j-2 L-2 



/j>2(X, A) = -Xj-iXL-j-1 + ^ (Xl+j - Xj^i)xi + {Xi+j^L - Xj^i)xi (13) 

1=0 l=L-j 

if \ 
Am = y 2/X„ + Ilk] (14) 



fc=0(^m) 



For a long chain {N » 1), the hopping process described by Eq.(9) becomes more trans- 
parent in the continuum hmit: 

L-2 



dQjX; N) ^ 1 
dN ~ 2 



(i-i)E 

j=0 



92Q(X; N) 



a[/,(X,A)Q(X;N)] 



dx'j 



N + No 



(15) 



The parameter A^o has been introduced as a cutoff below which the above continuum version 
of Eq.(9) would not be meaningful. It could depend on the memory parameters, fi^s [A]. 

In order to solve this equation subject to the initial condition, Q(K]N = 0) = 5(X), we 
first Fourier transform it with respect to the variables Xj's: 

L-2 



dQiq:,N) 
dN 



j=0 



N + No 



/.•(q;A) 



dQ{ci;N ) 



(16) 



Here, q/s are the Fourier conjugates of the x/s and the /j(q, A)'s are obtained from Eqs.(ll- 
14) by substituting gr^'s for x/s. This first order equation can then be solved by the method 
of characteristics. In particular, we have to solve the equations, 

dN dqo dqi dqL-2 
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(17) 



N + No /o(q,A) /i(q,A) /L-2(q,A) 

The standard method of solving them leads to rewriting Eq.(16) in terms of the normal 
coordinates, g^'^'s [7]: 

L-2 / 

~2' 



ag(q;iV) 



dN 



El4(^-i)(a^0(q;iV)+ ' 



(18) 



where the p's are the eigenvalues of the 'LRC matrix, 

/ Ao 
Ai — Ao 
A2 — Ai 



A/ 



Ai A2 
A2 — Ao A3 — Aq 
A3 — Ai A4 — Ai 



Al-3 — Al_4 Al_2 — Xl-A — Al_4 
\ Al_2 — Ai_3 — Ai_3 Ao — Al_3 

A further transformation. 



Al-3 
Al_2 — Ao 
-Ai 



Al-2 

-Ao 
Ao — Ai 



Al-6 — Ai,_4 Al_5 — Al_4 
Al-5 ~ Ai_3 Ai;,_4 — Xl-^ J 



(19) 



/Wt^'''; t = A^ + A^o; Z = 0,1,--- ,L-2 



(20) 
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leads to the simpler form, 



'L-2 



dr 



(21) 



.1=0 



whose solution can be written down immediately. Transforming back to the variables, q^''^ 
and using the initial condition, (5(q; t — Nq) = 1, we have. 



g(q;r)=exp --(L-1)^ 



1 - 



-2pW+l- 



(1 - 2p(0) 



(22) 



Since the maximum eigenvalue, p = max p*^'\ decides the asymptotic behaviour of Q, the 
variance associated with the number of any given symbol turns out to be 



aHr) 



1 - 



iVn 



-2p+l 



(l-2p)' 



(23) 



N + No^ 

It is clear from the above expression that o'^(r) oc t'^, where u = 1 whenever 2p < 1 and 
V = 2p whenever 2p > 1. Even in the case, 2p = 1, the exponent u = 1 but there will be 
logarithmic corrections. The existence of a critical value, Pc — 1/2, for p implies that the 
(L — 1) dimensional parameter space, {/ii \ —1 < /ii < {L — l);l — 0,1, - ■ ■ , L — 2} is divided 
into two distinct regions, diffusive — 1) and superdiffusive {u > 1). In the next section, 
we present exact solutions to the special cases, ternary and quaternary sequences. 



III. SPECIAL CASES 



A. Quaternary (four- symbols) sequence: 

There are only three independent correlation parameters, /io,A*i,A*2 G [~1)3], for a qua- 
ternary sequence which define the A's, Eq.(14): 

1 



Ao 



{2po + pi + P2) 



Ai = - (/io + 2//1 + //2) 
1 

A2 = ^ (a^o + Pi + 2p2) 

The corresponding LRC matrix is then given by Eq.(19), 

/ 



(24) 



A. 



Ao Ai A2 
Ai — Ao A2 — Ao — Ao 
A2 ~ Ai — Ai Ao — Ai 



(25) 



FIG. 1: The diffusive part of the phase space associated with the quaternary sequence. It is slanted 
and ehipsoidal in shape, cut by the coordinate planes as well as by the plane /^o + /"i + = 1- 

whose eigenvalues are given by 



Since pi is always positive, the maximum value between po and pi decides which part of 
the {po, pi, yU2)-parameter space is diffusive and which part is super- diffusive. The diffusive 
part of the available phase space is shown in Fig.l. It is slanted and ellipsoidal in shape, 
cut by the bounding planes. Its projections on the {po,pi), {pi, P2) and {po, P2) planes are 
shown in Fig. 2. Also shown in the figure are the projections of the available phase space 
containing the truncated ellipsoidally shaped diffusive region. 
B. Ternary (three- symbols) sequence [6]: 

There are only two independent parameters, po, pi E [—1,2], for the ternary sequence 
and the corresponding LRC matrix. 



Po = Ao - Ai + A2 = -{po + P2) 



Pi = ^J{\o-\2? + \l = 



[{P0 + Pl? + {PI+P2?] 



(26) 




(27) 



has a positive eigenvalue given by 




(28) 
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FIG. 2: The projections of the diffusive region shown in Fig.l on the three coordinate planes. 
The triangular regions shown in lighter hue are the projections of the available phase space prism 
containing the diffusive region shown in Fig.l. 

where Aq = (2/io + Ati)/3 and Ai = (/io + 2yUi)/3. The diffusive region of the (/io,/ii)- 
parameter space is then defined by the condition p < 1/2 and corresponds to the elliptical 
region shown in Fig. 3. The area of this diffusive region is 7rv^/2 which is roughly 60% of 
the available phase space area. This may be contrasted with the binary case where it is 
exactly 50%. Now the question arises whether a diffusive subregion of the ternary is likely 
to be mapped into a superdiffusive region of the binary due to a coarse-graining process. 
Since a mapping of a set of three symbols to a set of two symbols always introduces a global 
bias in the system, we will have to reformulate the binary case accordingly. 
C. Mapping the ternary into the binary: 

The a priori probabilities assigned to the symbols of the ternary sequence are 



where the memory parameters, yUj's, have their values in the range, [—1,2] and satisfy the 
above constraint. If we replace, for example, the symbol one of the ternary by symbol zero, 
we have a binary sequence of symbols, zero and one, with the a priori probabilities. 



where the constraint given in Eq.(29)has been made use of. The above definitions are of the 




i = 0,1,2 



(29) 




(30) 



-1.0 -0.5 0.0 0.5 1.0 1.5 2.0 

^0 

FIG. 3: Phase diagram for the ternary sequence. Li and L2 refer to the hnes fiQ + fii = 1 and 
fJ-o + fJ-i = 1/4 respectively. The ellptical region of the triangular phase space corresponds to the 
diffusive behavior whereas the region exterior to it corresponds to the superdiffusive behavior of the 
ternary sequences. The entire region between the lines Li and L2, inclusive of the shaded elliptical 
region, gets mapped into the superdiffusive regime of the binary. 

general form, 

gi = 6(1 + /i2); 9o = b' (^1- ^ /i2^ (31) 

with 6=1/3 and b' = {1 — b) = 2/3. This implies that a ternary sequence without memory 
(yUo = yUi = /i2 = in Eq.(29)) corresponds to a binary sequence without memory (/i2 = in 
Eq.(30)) but consisting of unequal number of zeros and ones. This is at variance with what 
is implied by Eq.(l) which provides the simplest mechanism for having a binary sequence 
with unbounded memory. 

In fact, Eq.(l) defines the conditional probability, Po{no, N), of finding zero as the (A^ + 
l)*'^ symbol of the sequence that already has no zeros: 

poino, N)^j^ (nogo + n,g,) = \{^+ ^^^"^^^ [2^0 - N]^ (32) 

It is clear from this definition that {2gQ — 1) parametrizes the memory-dependence for the 
occurrence of a symbol in the sequence. A sequence without memory corresponds to the 
unbiassed case {go = 1/2 = gf) and, on the average, has equal number of zeros and ones. 
In general, for arbitrary values of g^, the mean difference between the number of zeros 
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and the number of ones has recently been shown [8] to have an asymptotic form, 

< 5n >=<\ no - ni \>^ ^^^N^ao-i (33) 

where q is the a priori probabihty of having the first symbol of the sequence as zero. For 
go < 1/2. the mean difference, < Sn >, vanishes even if the first symbol were chosen with a 
biassed coin. In other words, the average number of zeros tends to be equal to the average 
number of ones for qq < 1/2. 

This implies that Eq.(32) is not the right definition for po{no, N) because we expect the 
average number of zeros, < no >, in the sequence to be equal to b' N ior fi2 > — 2b) /2b 
(equivalently, for go < 1/2). Since the natural variable for a Markov chain generated with a 
biassed coin is x{ = [no/b' — iV], say), it is likely that the conditional probability po(^O) -^) 
depends on uq through the variable x. We therefore try the ansatz, 

Po{x,N) = a + ^x (34) 
P = b'{2go-l) (35) 

where a is a parameter yet to be fixed. The probability, Q{x, N), that the number of zeros 
in a chain of length is Uq can now be shown to satisfy the following equation in the 
continuum limit: 

^ ^ _ ^,'2^^ , n - - ^ ^t^^] r3fi^ 

dN 26'2^ ^dx^ ^ b''dx b'iN + No) dx 

where, as mentioned earlier. No denotes a cutoff value for N below which the above contin- 
uum description is not valid. 

This equation suggests that an interesting description from the LRC's point of view 
corresponds to the case when a — b' [9] leading to the following general definition for 
Po{no,N): 



(2^70 - 1) 



no 



poino, N) ^b'\^l+ '-^^^^ [y~^\) ^^^^ 

which reduces to that given by Eq.(32) for b' — 1/2. The first moment of the distribution, 
< X >, then turns out to have the interesting form, < x >~ {N + Noy^°~^, while the 
variance of the distribution is given by, 

, N for (2oo - 1) < 1/2 

^ ^2(2so-i) for (2^0-1) > 1/2 
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In other words, long-range correlations in a binary sequence are characterized by the expo- 
nent, {2go — 1), for 510 > 3/4 which in turn corresponds to having the values of 112 in the 
range /is e (-1, -1 + [1/46]) by Eq.(31). 

Since H2 = "(/^o + f^i) and b = 1/3, the condition for non-trivial correlations in the 
coarse-grained binary sequence turns out to be /io + /xi > 1/4, above the lower line shown in 
Fig. 3. The entire region of the phase space between this line and the upper line, /io + A*i = 1, 
corresponds to long-range correlation when mapped into the biassed binary. 

Defining the memory parameter, /i, for the binary sequence simply as /i = — //2/2, we 
see that the phase space region, 1/4 < (/Xo + Ati) < 1, of the ternary sequence corresponds 
to the parameter range, l/8</i<l/2, of the biassed binary sequence; every line parallel 
to and in between the upper and lower lines in Fig. 3 is mapped into a point in the range 
1/8 < /I < 1/2 and vice versa. In particular, the shaded part of the ellipse in Fig. 3 is the 
diffusive area that is mapped into the superdiffusive regime of the biassed binary, which in 
general is characterized by a parameter, in the range (—1 -|- [3/46], —1 -|- [1/6]). 

Conversely, the fact that the correlation parameter of the biassed binary sequence under 
study has a value in the range fi G (—1 + [3/46], —1 + [1/6]) does not necessarily mean that 
the parent ternary sequence also has long-range correlations. Even if we know a priori that 
there are long-range correlations between symbols of the parent sequence, the parameter /i 
is not a true measure of its strength. 

IV. SUMMARY 

We have presented a formalism for studying the correlation properties of a many-symbols 
sequence and, by way of illustration, obtained the phase diagrams for quaternary and ternary 
sequences. Motivated by the fact that the diffusive portion of the phase space is larger for the 
ternary than for the binary, we have studied a mapping between the two. It turns out that 
long-range correlation for the binary does not necessarily imply long-range correlation for the 
ternary. This result has deeper implications for the coarse-graining of the many-alphabets 
sequence. For example, if we do not know a priori that there are long-range correlations in 
the original sequence, then we cannot conclude that long-range correlation found in a coarse- 
grained sequence implies that in the original sequence. On the other hand, even if we know 
a priori that there are long-range correlations in the original sequence, we cannot conclude 
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that the correlation parameters of the coarse-grained sequence represent the true strength of 
correlations in the parent sequence, if we do not know how the coarse-grained sequence had 
been obtained. It is also likely that different coarse-graining paths from the parent sequence 
to lower dimensional representative sequence could lead to different inferences [10]. The 
applicability of this result to a generic symbolic sequence needs further investigation since 
it is obtained in a formalism dealing with non-stationary sequences. 
SLN wishes to acknowledge helpful discussions with Y. S. Mayya. 
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